Goto

Collaborating Authors

 quality dimension


PersonaMatrix: A Recipe for Persona-Aware Evaluation of Legal Summarization

Pang, Tsz Fung, Berijanian, Maryam, Orth, Thomas, Shi, Breanna, Alexander, Charlotte S.

arXiv.org Artificial Intelligence

Legal documents are often long, dense, and difficult to comprehend, not only for laypeople but also for legal experts. While automated document summarization has great potential to improve access to legal knowledge, prevailing task-based evaluators overlook divergent user and stakeholder needs. Tool development is needed to encompass the technicality of a case summary for a litigator yet be accessible for a self-help public researching for their lawsuit. We introduce PersonaMatrix, a persona-by-criterion evaluation framework that scores summaries through the lens of six personas, including legal and non-legal users. We also introduce a controlled dimension-shifted pilot dataset of U.S. civil rights case summaries that varies along depth, accessibility, and procedural detail as well as Diversity-Coverage Index (DCI) to expose divergent optima of legal summary between persona-aware and persona-agnostic judges. This work enables refinement of legal AI summarization systems for both expert and non-expert users, with the potential to increase access to legal knowledge. The code base and data are publicly available in GitHub.


DocIQ: A Benchmark Dataset and Feature Fusion Network for Document Image Quality Assessment

Ma, Zhichao, Huang, Fan, Zhao, Lu, Guo, Fengjun, Zhai, Guangtao, Min, Xiongkuo

arXiv.org Artificial Intelligence

Document image quality assessment (DIQA) is an important component for various applications, including optical character recognition (OCR), document restoration, and the evaluation of document image processing systems. In this paper, we introduce a subjective DIQA dataset DIQA-5000. The DIQA-5000 dataset comprises 5,000 document images, generated by applying multiple document enhancement techniques to 500 real-world images with diverse distortions. Each enhanced image was rated by 15 subjects across three rating dimensions: overall quality, sharpness, and color fidelity. Furthermore, we propose a specialized no-reference DIQA model that exploits document layout features to maintain quality perception at reduced resolutions to lower computational cost. Recognizing that image quality is influenced by both low-level and high-level visual features, we designed a feature fusion module to extract and integrate multi-level features from document images. To generate multi-dimensional scores, our model employs independent quality heads for each dimension to predict score distributions, allowing it to learn distinct aspects of document image quality. Experimental results demonstrate that our method outperforms current state-of-the-art general-purpose IQA models on both DIQA-5000 and an additional document image dataset focused on OCR accuracy.


Evaluating Quality of Gaming Narratives Co-created with AI

Valdivia, Arturo, Burelli, Paolo

arXiv.org Artificial Intelligence

--This paper proposes a structured methodology to evaluate AI-generated game narratives, leveraging the Delphi study structure with a panel of narrative design experts. Our approach synthesizes story quality dimensions from literature and expert insights, mapping them into the Kano model framework to understand their impact on player satisfaction. The results can inform game developers on prioritizing quality aspects when co-creating game narratives with generative AI. While generative AI has surged into public and research consciousness following the release of systems like ChatGPT, video games have a longer tradition of using AI techniques to generate content that would otherwise be authored by human designers. This tradition is well established in the field of Procedural Content Generation, which encompasses a range of methods for algorithmically creating game elements such as levels, characters, quests, and storylines [1].


Geopolitical Parallax: Beyond Walter Lippmann Just After Large Language Models

Yavuz, Mehmet Can, Kabir, Humza Gohar, Özkan, Aylin

arXiv.org Artificial Intelligence

Objectivity in journalism has long been contested, oscillating between ideals of neutral, fact-based reporting and the inevitability of subjective framing. With the advent of large language models (LLMs), these tensions are now mediated by algorithmic systems whose training data and design choices may themselves embed cultural or ideological biases. This study investigates geopolitical parallax-systematic divergence in news quality and subjectivity assessments-by comparing article-level embeddings from Chinese-origin (Qwen, BGE, Jina) and Western-origin (Snowflake, Granite) model families. We evaluate both on a human-annotated news quality benchmark spanning fifteen stylistic, informational, and affective dimensions, and on parallel corpora covering politically sensitive topics, including Palestine and reciprocal China-United States coverage. Using logistic regression probes and matched-topic evaluation, we quantify per-metric differences in predicted positive-class probabilities between model families. Our findings reveal consistent, non-random divergences aligned with model origin. In Palestine-related coverage, Western models assign higher subjectivity and positive emotion scores, while Chinese models emphasize novelty and descriptiveness. Cross-topic analysis shows asymmetries in structural quality metrics Chinese-on-US scoring notably lower in fluency, conciseness, technicality, and overall quality-contrasted by higher negative emotion scores. These patterns align with media bias theory and our distinction between semantic, emotional, and relational subjectivity, and extend LLM bias literature by showing that geopolitical framing effects persist in downstream quality assessment tasks. We conclude that LLM-based media evaluation pipelines require cultural calibration to avoid conflating content differences with model-induced bias.


LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought

Qi, Ruiyan, Wen, Congding, Zhou, Weibo, Li, Jiwei, Liang, Shangsong, Li, Lingbo

arXiv.org Artificial Intelligence

Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15\% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.


From Logic to Language: A Trust Index for Problem Solving with LLMs

Rug, Tehseen, Böhmer, Felix, Pfattheicher, Tessa

arXiv.org Artificial Intelligence

Classical computation, grounded in formal, logical systems, has been the engine of technological progress for decades, excelling at problems that can be described with unambiguous rules. This paradigm, however, leaves a vast ocean of human problems -- those characterized by ambiguity, dynamic environments, and subjective context -- largely untouched. The advent of Large Language Models (LLMs) represents a fundamental shift, enabling computational systems to engage with this previously inaccessible domain using natural language. This paper introduces a unified framework to understand and contrast these problem-solving paradigms. We define and delineate the problem spaces addressable by formal languages versus natural language. While solutions to the former problem class can be evaluated using binary quality measures, the latter requires a much more nuanced definition of approximate solution space taking into account the vagueness, subjectivity and ambiguity inherent to natural language. We therefore introduce a vector-valued trust index Q, which reflects solution quality and distinguishes the binary correctness of formal solutions from the continuous adequacy spectrum characteristic of natural language solutions. Within this framework, we propose two statistical quality dimensions. Normalized bi-semantic entropy measures robustness and conceptual diversity of LLM answers given semantic variation in problem formulations. Emotional valence maps subjective valuation of a solution to a quantifiable metric that can be maximized by invoking statistical measures. The concepts introduced in this work will provide a more rigorous understanding of the capabilities, limitations, and inherent nature of problem-solving in the age of LLMs.


Beyond Accuracy: What Matters in Designing Well-Behaved Models?

Hesse, Robin, Bağcı, Doğukan, Schiele, Bernt, Schaub-Meyer, Simone, Roth, Stefan

arXiv.org Artificial Intelligence

Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect the quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high fairness on ImageNet-1k classification and strong robustness against domain changes; (ii) self-supervised learning is an effective training paradigm to improve almost all considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.


Natural Language Generation

van Miltenburg, Emiel, Lin, Chenghua

arXiv.org Artificial Intelligence

This article provides a brief overview of the field of Natural Language Generation. The term Natural Language Generation (NLG), in its broadest definition, refers to the study of systems that verbalize some form of information through natural language. That information could be stored in a large database or knowledge graph (in data-to-text applications), but NLG researchers may also study summarisation (text-to-text) or image captioning (image-to-text), for example. As a subfield of Natural Language Processing, NLG is closely related to other sub-disciplines such as Machine Translation (MT) and Dialog Systems. Some NLG researchers exclude MT from their definition of the field, since there is no content selection involved where the system has to determine what to say. Conversely, dialog systems do not typically fall under the header of Natural Language Generation since NLG is just one component of dialog systems (the others being Natural Language Understanding and Dialog Management). However, with the rise of Large Language Models (LLMs), different subfields of Natural Language Processing have converged on similar methodologies for the production of natural language and the evaluation of automatically generated text.


AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark

Li, Lan, Fang, Liri, Torvik, Vetle I.

arXiv.org Artificial Intelligence

We investigate the reasoning capabilities of large language models (LLMs) for automatically generating data-cleaning workflows. To evaluate LLMs' ability to complete data-cleaning tasks, we implemented a pipeline for LLM-based Auto Data Cleaning Workflow (AutoDCWorkflow), prompting LLMs on data cleaning operations to repair three types of data quality issues: duplicates, missing values, and inconsistent data formats. Given a dirty table and a purpose (expressed as a query), this pipeline generates a minimal, clean table sufficient to address the purpose and the data cleaning workflow used to produce the table. The planning process involves three main LLM-driven components: (1) Select Target Columns: Identifies a set of target columns related to the purpose. (2) Inspect Column Quality: Assesses the data quality for each target column and generates a Data Quality Report as operation objectives. (3) Generate Operation & Arguments: Predicts the next operation and arguments based on the data quality report results. Additionally, we propose a data cleaning benchmark to evaluate the capability of LLM agents to automatically generate workflows that address data cleaning purposes of varying difficulty levels. The benchmark comprises the annotated datasets as a collection of purpose, raw table, clean table, data cleaning workflow, and answer set. In our experiments, we evaluated three LLMs that auto-generate purpose-driven data cleaning workflows. The results indicate that LLMs perform well in planning and generating data-cleaning workflows without the need for fine-tuning.


Data Quality Awareness: A Journey from Traditional Data Management to Data Science Systems

Dong, Sijie, Sahri, Soror, Palpanas, Themis

arXiv.org Artificial Intelligence

Artificial intelligence (AI) has transformed various fields, significantly impacting our daily lives. A major factor in AI success is high-quality data. In this paper, we present a comprehensive review of the evolution of data quality (DQ) awareness from traditional data management systems to modern data-driven AI systems, which are integral to data science. We synthesize the existing literature, highlighting the quality challenges and techniques that have evolved from traditional data management to data science including big data and ML fields. As data science systems support a wide range of activities, our focus in this paper lies specifically in the analytics aspect driven by machine learning. We use the cause-effect connection between the quality challenges of ML and those of big data to allow a more thorough understanding of emerging DQ challenges and the related quality awareness techniques in data science systems. To the best of our knowledge, our paper is the first to provide a review of DQ awareness spanning traditional and emergent data science systems. We hope that readers will find this journey through the evolution of data quality awareness insightful and valuable.